Sora is powerful, but there’s no need to overhype it.
Following the emergence of the text-based video model Sora, discussions about it have flooded the internet. In this article, the author has compiled 10 key points of information that might help you better understand Sora’s capabilities, background, and its impact.
Why is it called Sora? What does it mean? The name Sora comes from the Japanese word meaning “sky,” with connotations of “freedom.” On the Sora official website, numerous paper airplanes fly freely, autonomously gathering to form the background color of the sky.
Additionally, in Korean, Sora represents a seashell, while in Finnish, it signifies gravel, easily evoking associations with the Nautilus from “Twenty Thousand Leagues Under the Sea” and the sci-fi film “Dune.”
According to Silicones, Sora is sometimes used as a verb in Japanese, meaning “to remember by heart without looking at any written material.” Derived words such as “Soranjiru” (そらんじる) mean “remember by heart,” which precisely reflects Sora’s ability.
The official explanation is that research team members Tim Brooks and Bill Pribus chose this name because it can “evoke unlimited creative potential.”
Does reality no longer exist? How amazing is Sora! The official OpenAI website posted a video of Tokyo street scenes created by Sora. The prompt was:
“Beautiful, snow-covered Tokyo, the camera moves through bustling city streets, following pedestrians enjoying the beautiful snowy day, some shopping at roadside stalls. Beautiful cherry blossoms dance in the wind with snowflakes.”
Another video was generated based on the following prompt:
“Several huge, furry mammoths approach on the snow-covered ground, the wind blowing their long hair, in the distance are tall trees and majestic mountains covered in snow, the afternoon light creates a warm glow.”
From this, it can be seen that Sora makes “generating videos from a sentence prompt” possible. The astonishing aspect of this ability is that when simulating the physical world, Sora can more accurately reflect the complexity and diversity of the real world. With a prompt, Sora “knows” how to tell a story using the language of cinematography.
Li Zhifei, founder and CEO of Megvii, believes that “videos,” as images of the physical world, are the result of rendering the world model. Compared to language data, the models learned from video big data are “models of models” and have learned many laws of the physical world, making the models closer to simulating the physical world.
The difference between text and video lies in the former understanding human logical thinking, while the latter understands the physical world. Therefore, if the video generation model Sora can integrate well with the text model LLM, then it truly has the potential to become a universal simulator of the world. If one day, such a system learns to drive in complex urban traffic environments through simulation scenes on its own, humans will not be surprised.
Many practitioners exclaim, “Reality no longer exists,” precisely because of this.